-
Notifications
You must be signed in to change notification settings - Fork 28.9k
[SPARK-54462][SQL] Add isDataFrameWriterV1 option for Delta datasource compatibility
#53173
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
[SPARK-54462][SQL] Add isDataFrameWriterV1 option for Delta datasource compatibility
#53173
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for informing the issue, @juliuszsompolski . I have a few questions.
- Which Apache Spark preview version and Delta version did you test this? Specifically, which preview did this start to happen?
- Why don't we implement this in
io.delta.sql.DeltaSparkSessionExtensionor document it instead of changing Apache Spark source code? - Do you think we can have a test coverage with a dummy data source?
- If this is an emergency fix, what would be the non-emergency fix?
This is an emergency fix to prevent a breaking change resulting in data corruption with Delta data sources in Spark
isDataFrameWriterV1 option for Delta datasource compatibility
| serde = None, | ||
| external = false, | ||
| constraints = Seq.empty) | ||
| val writeOptions = if (source == "delta") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does Apache Spark source code have this kind delta-specific logic before, @juliuszsompolski ?
This looks like the first proposal to have a 3rd-party company data source in Apache Spark source code. At the first glance, this string match looks a little fragile to me.
| // To retain compatibility of the Delta datasource with Spark 4.1 in connect mode, Spark | ||
| // provides this explicit storage option to indicate to Delta datasource that this call | ||
| // is coming from DataFrameWriter V1. | ||
| // |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per:
// FIXME: Since the details of the documented semantics of Spark's DataFrameWriter V1
// saveAsTable API differs from that of CREATE/REPLACE TABLE AS SELECT, Spark should
// not be reusing the exact same logical plan for these APIs.
// Existing Datasources which have been implemented following Spark's documentation of
// these APIs should have a way to differentiate between these APIs.
Why don't we just always append the option? The downstream datasources who care about this behaviour will make the change accordingly.
What changes were proposed in this pull request?
Make DataFrameWriter saveAsTable add a writeOption
isDataFrameWriterV1 = truewhen using Overwrite mode with adeltaData source.This is an emergency fix to prevent a breaking change resulting in data corruption with Delta data sources in Spark 4.1.
Why are the changes needed?
Spark's SaveMode.Overwrite is documented as:
It does not define the behaviour of overwriting the table metadata (schema, etc). Delta datasource interpretation of this API documentation of DataFrameWriter V1 is to not replace table schema, unless Delta-specific option "overwriteSchema" is set to true.
However, DataFrameWriter V1 creates a ReplaceTableAsSelect plan, which is the same as the plan of DataFrameWriterV2 createOrReplace API, which is documented as:
Therefore, for calls via DataFrameWriter V2 createOrReplace, the metadata always needs to be replaced, and Delta datasource doesn't use the overwriteSchema option.
Since the created plan is exactly the same, Delta had used a very ugly hack to detect where the API call is coming from based on the stack trace of the call.
In Spark 4.1 in connect mode, this stopped working because planning and execution of the commands go decoupled, and the stack trace no longer contains this point where the plan got created.
To retain compatibility of the Delta datasource with Spark 4.1 in connect mode, Spark provides this explicit storage option to indicate to Delta datasource that this call is coming from DataFrameWriter V1.
Followup: Since the details of the documented semantics of Spark's DataFrameWriter V1 saveAsTable API differs from that of CREATE/REPLACE TABLE AS SELECT, Spark should not be reusing the exact same logical plan for these APIs.
Existing Datasources which have been implemented following Spark's documentation of these APIs should have a way to differentiate between these APIs.
However, at this point this is an emergency fix, as releasing Spark 4.1 as is would cause data corruption issues with Delta in DataFrameWriter saveAsTable in overwrite mode, as it would not be correctly interpreting it's overwriteSchema mode.
Does this PR introduce any user-facing change?
No
How was this patch tested?
It has been tested with tests that are not part of the PR. To properly test in connect mode, changes are needed on both Spark and Delta side and integrating it will be done as followup work.
Was this patch authored or co-authored using generative AI tooling?
Assisted by Claude Code.
Generated-by: claude code, model sonnet 4.5